Webscraping v R

Workshop FF UK 5.10.2023 1️⃣
Vybrané kapitoly z analýzy dat

Renata Topinkova

LMU Munich
📫 renata.topinkova[at]lmu.de

What are we going to cover?

1️⃣ What are APIs?

2️⃣ API with R package 💻

3️⃣ API without dedicated R package 💻

4️⃣ Web scraping

5️⃣ Web scraping practice 💻

All materials available at: https://github.com/renatatopinkova/2023_scraping_FFUK

Downloading materials

https://github.com/renatatopinkova/2023_scraping_FFUK

APIs

What does API stand for?

API = Application Programming Interfaces

  • designated points of data access

How does it work?

Source: https://www.geeksforgeeks.org/what-is-an-api/

How does it work?

Pros and cons

✅ Legal (mostly)

✅ Structured data

✅ More robust to changes on the web

✅ There may be an R package!

❌ May not be available

❌ May need authentication

❌ Rate limits

❌ May be paid only

❌ Quality of documentation varies

❌ Can be cancelled anytime

Can be cancelled anytime…

On April 4, 2018, the post-API age reached a milestone. On that day, Facebook closed access to its Pages API, which had allowed researchers to extract posts, comments, and associated metadata from public Facebook pages (Schroepfer, 2018). This decision followed the company’s April 2015 closure of its public search Application Programming Interface (API), which provided searchable access to all public posts within a rolling two week window (Facebook, n.d.). The closure of the Pages API eliminated all terms of service (TOS)-compliant access to Facebook content. Let me underscore the magnitude of this shift: There is currently no way to independently extract content from Facebook without violating its TOS.(Freelon 2018)

Current state of affairs

  • Meta: No API access for researchers to Facebook and Instagram

  • Twitter: as of Feb 2023, paid only,

    • e.g. 100$ per month for max. 10,000 tweets
  • TikTok

  • Pushshift API (access to (historical) Reddit data), closed as of May 2023.

    • Reddit API still usable, but max. 1000 results and limited number of calls.

API examples

📱Social media

Reddit

Youtube

Spotify


📰 News

Guardian

NYT


📊 Data sources

OECD

WHO


🌍 Gov

Data.police.uk

Covid data


🏴‍☠️ Unofficial “APIs”

Omdb

Google Trends


➿ Other

GoogleMaps

AccuWeather


and many more

Starting with APIs


Isn’t there an R package for that?

📦 WHO, guardianapi, spotifyR, nytimes, wbstats, RedditExtractoR


Are you sure?

Google, Github


If you’re SURE sure… Generic package

📦 httr, httr2

Starting with APIs


📖 STEP 1 : Read the documentation

  • Endpoint = designated point for data collection (often > 1)

  • Parameters = How can I narrow down what I want to get? What can I get? What values does the API accept?

  • Authentication = Do I need API token? How do I get it? Where do I put it?

  • Rate limits = How much can I download in a minute/day?

  • ToS = What are you allowed to do with the data? Can you publish it? In what form?

If you are using a package 📦, read the package vignette 📖

Documentation

For the API: Can be found on the Developer Platforms in Docs

Google: [name of platform] API

Package documentation

Rate limits

  • Not always known

  • Exceeding rate limits can lead to being blocked or throttling

  • This is usually solved with Sys.sleep in R, but some packages include it within queries

Guardian API

The Guardian

Is there an API?

How do I authenticate?

Link

Signing up for a token

Link

Signing up for a token

Project name, e.g.: API retrieval class

Store the key

After this, receive an email -> verify your address ->receive a key

Store the key as a .txt document in your project directory.

What can I get?

Explore the data

Preview

Is there a package?

How do I use it in R?

Preview

Rate limits

Let’s get practical: Switch to R

Open the 01_API_w_package_exercise.qmd file.

25:00

References

Freelon, Deen. 2018. “Computational Research in the Post-API Age.” Political Communication 35 (4): 665–68. https://doi.org/10.1080/10584609.2018.1477506.